feat(orchestration): introduce orchestration system by Fodesu · Pull Request #408 · memohai/Memoh

Fodesu · 2026-04-26T17:31:16Z

Summary

本 PR 为 Memoh 引入面向长复杂任务的编排执行层，使任务具备可观察、可恢复、局部重试、验证与审计能力，从而提升用户对真实任务托付的信任。

Memoh 已经具备成熟的 chat-first bot runtime：对话、工具调用、记忆注入以及 bot workspace 内的命令执行都可以在现有运行路径中完成。这条路径适合对话式协作和短周期任务。

长复杂任务需要的不只是一次模型回复，而是一个可被管理的运行对象。这个 PR 引入 orchestration runtime，将这类任务从聊天上下文中独立出来：User Bot 仍然负责用户交互，run 自身拥有状态机、worker、verifier 和 Web UI。

本 PR 的 review 主线是 durable run/task execution layer。NATS、blackboard、EnvSession 等内容服务于这条主线，是围绕该执行层建立的初版 runtime contracts，不代表完整 autonomous task platform 已经完成。

Why：长复杂任务需要执行层

长复杂任务不是一次问答，而是一段会随中间结果改变路径的执行过程。它可能拆分出多个依赖步骤，某些步骤失败后需要局部修正，某些结果需要验证后才能进入下一步。对这类任务，聊天上下文适合作为交互入口，但不适合作为执行状态本身。

可观察性

真实任务不能只留下最终回复。用户和维护者需要了解任务被拆分成哪些步骤、当前停留在哪个阶段、哪些 attempt 已完成、哪些 checkpoint 等待处理，以及哪些 artifact 或 env 状态参与了结果生成。

Orchestration 将 run、task、attempt、event、checkpoint、verification 建模为可查询对象。Web UI 提供 DAG、Flow timeline 和 task inspector，使维护者能够回答：任务为什么停在这里、谁持有 lease、哪个 attempt 失败、失败后是否触发了 retry/replan，以及最终结果经过了哪次 verification。UI 在这里是 trust surface，不只是 debug tooling。

可恢复性

长任务需要应对进程重启、lease 过期、worker 离线、verification 中断、人工等待超时等情况。聊天上下文只能恢复叙述，持久化 run state 才能恢复调度和执行进度。

本 PR 使用 Postgres 记录权威状态，通过 worker/verifier lease fencing 阻止过期执行者继续写入，并通过 recovery loop 接管过期的 attempt、verification 和 checkpoint。任务中断后，系统可以判断应继续、重试、等待还是失败。

局部重试

复杂任务通常由多个相互依赖的步骤组成。某个步骤失败时，不应默认丢弃整个 run，也不应要求用户手动判断哪些结果仍可复用。

Orchestration 将任务拆分为 DAG，并将 retry、replan、supersede 表达为显式状态迁移。失败发生在 task/attempt 层，系统可以仅重做需要重做的分支，并保留已经有效的结果。

验证与审计

长任务交给 agent 执行之后，用户需要的不只是"模型声明完成"。结果是否可接受、失败来自工具还是任务本身、谁在何时执行了什么，都应进入 run 生命周期。

本 PR 引入 verifier runtime、verification records、committed run events、runtime facts 和 side-effect approval token，使 agent output 从"最终回答"变成可被 accept/reject 的 claim。验证结果、运行事实和事件记录也会成为后续 eval、资源治理和经验沉淀的输入。

Who：组件职责

User Bot：用户对话入口。可以通过 orchestration tool 发起 run、查询 run、注入 hint、取消 run，但不承担 DAG 调度职责。
API server (cmd/agent)：负责 HTTP API、鉴权以及 Web UI 读路径。不依赖 NATS 启动；mutating orchestration API 会检查 orchestrator 是否在线，控制平面不可用时返回 503。
Orchestrator (cmd/orchestrator)：控制平面 daemon，负责 planning、scheduling、recovery、verification recovery、event outbox、fact consumer 和 env reclaim。
Worker (cmd/workerd) / Verifier (cmd/verifyd)：task-scoped runtime，分别负责执行 attempt 和验证 completed attempt。
Web UI：操作与观测界面，提供 DAG、Flow timeline、task inspector、checkpoint、artifact、execution span 和 env state。

What：这个 PR 引入的运行模型

本 PR 引入一套持久化 run 模型：

run、task、dependency、attempt、result、checkpoint、event、worker、verification
用 orchestration intent 推进 start、replan、checkpoint resume、attempt finalize
append-and-supersede 的 replanning，不原地改历史
worker / verifier claim fencing
committed event outbox + runtime fact envelope
blackboard writer / CAS 规则
env resource、session、binding、snapshot、side-effect approval token
Web DAG 和 Flow 视图

当前实现已覆盖主路径：创建 run、规划 DAG、调度 attempt、执行任务、验证结果、从 lease loss 恢复、等待用户输入，并在 UI 中查看完整过程。

When：适用场景

并非所有任务都需要进入 orchestration。更适合使用该路径的任务包括：

需要跨重启继续的 research / coding work
多步骤任务，某一步失败不应该让整个 run 作废
需要验证之后才能接受结果
需要浏览器或容器现场跨 attempt 保存
中途可能需要等待用户输入

不适合进入 orchestration 的任务应继续留在 chat/runtime 路径中，例如一次性问答、短工具调用、无需验证的轻量操作、没有跨步骤状态的任务。Orchestration 会引入额外状态、daemon 和 UI 面，只有当任务价值足以覆盖这些成本时才值得使用。

典型生命周期如下：

created -> running -> completed
created -> running -> waiting_human -> running -> completed
created -> running -> failed
created -> running -> cancelling -> cancelled

Where：代码与部署边界

主要代码位置：

internal/orchestration：持久化状态机和 service API
internal/orchestrationexec：planner、replanner、worker runtime、verifier runtime
cmd/orchestrator：控制平面 daemon
cmd/workerd：worker daemon
cmd/verifyd：verifier daemon
internal/orchestrationbus：event/fact bus contract 和 JetStream 实现
internal/orchestrationoutbox：committed event publisher
internal/orchestrationfacts：fact validator/consumer
internal/orchestrationblackboard：可重建的运行期视图
internal/orchestrationenv：environment session runtime
internal/handlers/orchestration.go：HTTP API
apps/web/src/pages/orchestration：DAG、Flow、inspector UI

Compose 部署新增 nats、orchestrator、workerd 和 verifyd。这些进程对应不同职责：API server 负责产品入口，orchestrator 负责控制平面，worker 和 verifier 负责任务执行与验证。拆分进程可以明确故障边界和扩缩容边界。

How：核心设计原则

这个设计要解决的不是“怎么多跑几个 worker”，而是长任务在失败、重启和重新规划之后仍然能解释清楚。原则按四类约束组织：事实写在哪里、谁推进状态、失败后怎么恢复，以及哪些东西只用于观测或运行现场。

状态与历史

Postgres 是事实来源。run、task、intent、attempt、verification、checkpoint 和 committed event 的最终状态都在 Postgres。NATS、blackboard、env session 和 daemon 内存可以丢失，但不能改变已提交事实。
历史用 append-and-supersede 表达。retry 创建新的 attempt，replan 追加新的 task 并 supersede 旧分支，terminal row 不会被重新打开。这样 UI、审计和后续 eval 才能解释任务为什么走到当前状态。
event 是权威时间线。committed run event 记录已经发生的状态迁移；worker/verifier fact 只是运行时观察结果，不能替代 event。

控制面与执行面

API server 不跑长任务控制循环。cmd/agent 是用户/API/UI 边界，负责鉴权、参数校验、读路径和用户触发的写操作。控制面不可用时，mutating orchestration API 返回 503，而不是静默降级。
Orchestrator 消费 intent 推进状态。start、replan、checkpoint resume、attempt finalize 等控制动作先落为 orchestration intent，再由 cmd/orchestrator 消费。这样控制动作可以恢复、重试，也可以被审计。
Worker 和 verifier 只处理 task-scoped runtime。cmd/workerd 执行 attempt，cmd/verifyd 验证 completed attempt。它们不决定整个系统的事实来源，只提交受 lease/fence 约束的状态迁移。

恢复与并发控制

所有运行时写入都要 fence。attempt、verification、worker 和 env session 均携带 lease token 或 epoch。过期执行者即使还活着，也不能继续提交写入。
恢复逻辑读持久化状态，而不是读内存。worker 离线、lease 过期、verification 中断、checkpoint 超时后，recovery loop 从 Postgres 判断应该继续、重试、等待、取消还是失败。
失败优先局部处理。失败先落在 task、attempt 或 verification 层。系统可以局部 retry、触发 replan、等待用户输入；只有策略耗尽或错误不可恢复时，run 才进入 failed。

运行时契约与观测

NATS 是传输，不是状态源。它负责 runtime fact 和 committed event delivery。NATS 丢失不能改变 Postgres 中已提交的事实。
Fact consumer 当前只做校验。它读取 worker/verifier 发布的 fact，并对照 Postgres 发现 orphan、stale claim 和 identity mismatch。它不是 API server 的下游，也不是当前的权威 writer。
Blackboard 和 env session 是可丢失的运行期对象。blackboard 可以从 Postgres 重建；env session 保存浏览器或容器现场，并通过 lease/fence 支持 hold、resume 和 reclaim。
验证与 UI 是信任面的一部分。worker output 不直接等于成功。Verifier、committed events、runtime facts 和 Web inspector 共同提供结果检查、失败归因、局部重试和后续经验沉淀的依据。

How it works：运行模型

这里按语义拆成多张图，而不是把所有关系塞进一张图：组件边界图说明谁负责什么，创建链路图说明用户请求如何落库，生命周期图说明 run/task 状态如何流转，runtime facts 图说明旁路观测如何校验。

组件边界

这张图只回答组件职责，不表达生命周期。cmd/agent 是用户和 Web 的 API 边界；cmd/orchestrator 是长任务控制面；cmd/workerd 和 cmd/verifyd 是任务运行时；Postgres 是唯一事实来源；NATS、blackboard 和 env session 是运行期契约，不是权威状态。

flowchart LR
    entry["User-facing API\nUser Bot / Web UI / cmd/agent"]
    postgres["Postgres\nauthoritative run state"]
    control["Control plane\ncmd/orchestrator"]
    runtime["Task runtime\ncmd/workerd / cmd/verifyd"]
    contracts["Runtime contracts\nNATS / Blackboard / Env sessions"]

    entry -->|"create, query, cancel, retry"| postgres
    control -->|"plan, schedule, recover"| postgres
    runtime -->|"attempt and verification transitions"| postgres

    control -->|"assigns work through persisted state"| runtime
    runtime -->|"publishes facts, uses execution context"| contracts
    control -->|"publishes events, manages env leases"| contracts

创建任务链路

用户仍然从 bot 对话入口发起任务。这个过程是请求时序，不是组件依赖图：User Bot 调用 orchestration tool，API server 负责鉴权和写入请求，最终在 Postgres 中创建 run、root task 和 start_run intent。

sequenceDiagram
    actor User
    participant Bot as User Bot
    participant Tool as orchestration tool
    participant API as API server cmd/agent
    participant DB as Postgres

    User->>Bot: ask for a long-running task
    Bot->>Tool: create orchestration run
    Tool->>API: POST CreateRun
    API->>API: validate request and authorize caller
    API->>DB: insert run, root task, start_run intent
    DB-->>API: run handle
    API-->>Tool: run_id and snapshot_seq
    Tool-->>Bot: run created

Run 生命周期

生命周期应该画成状态机。Orchestrator 推进 run 和 task 的权威状态；worker/verifier 只是通过带 lease/fence 的状态迁移提交执行和验证结果。失败不一定终止整个 run，它可能进入局部 retry、replan 或等待用户输入。

stateDiagram-v2
    [*] --> Created
    Created --> Planning: start_run intent
    Planning --> Running: DAG committed
    Planning --> Failed: planner exhausted

    Running --> Dispatching: ready task claimed
    Dispatching --> AttemptRunning: worker lease acquired
    AttemptRunning --> Verifying: attempt completed
    AttemptRunning --> Retrying: attempt failed and retryable
    AttemptRunning --> Failed: attempt failed and terminal

    Verifying --> TaskCompleted: result accepted
    Verifying --> Retrying: verifier requests retry
    Verifying --> Replanning: verifier requests replan
    Verifying --> WaitingHuman: checkpoint required
    Verifying --> Failed: verification failed and terminal

    Retrying --> Dispatching: new attempt
    Replanning --> Planning: replan intent
    WaitingHuman --> Running: checkpoint resolved
    WaitingHuman --> Failed: checkpoint timeout

    TaskCompleted --> Running: more ready tasks
    TaskCompleted --> Completed: all tasks completed

    Running --> Cancelling: cancel requested
    Dispatching --> Cancelling: cancel requested
    AttemptRunning --> Cancelling: cancel requested
    WaitingHuman --> Cancelling: cancel requested
    Cancelling --> Cancelled

    Completed --> [*]
    Failed --> [*]
    Cancelled --> [*]

Runtime facts 与观测

runtime facts 是旁路观测，不是当前的主状态写入路径。Worker 和 verifier 把 attempt/verification fact 发到 NATS；fact consumer 读取这些 fact，并对照 Postgres 校验 orphan、stale claim 和 identity mismatch。当前 fact consumer 是 read-only validation，不是 API server 的下游，也不是权威 writer。

flowchart LR
    worker["Worker\ncmd/workerd"] --> facts["NATS JetStream\nruntime facts"]
    verifier["Verifier\ncmd/verifyd"] --> facts
    facts --> consumer["Fact consumer\nread-only validation today"]
    consumer -.-> postgres["Postgres\nauthoritative state"]

    outbox["Committed event outbox"] --> events["NATS JetStream\ncommitted events"]
    events --> webUi["Web UI stream\nDAG / Flow / inspector"]
    postgres --> blackboard["Blackboard\nrebuildable runtime view"]

Postgres 是事实来源，保存 run、task、intent、attempt、verification 和 committed event。NATS 只负责 runtime fact 和 committed event delivery；API server 不依赖 NATS 启动，NATS 丢失也不应改变已提交事实。Blackboard 是可重建的 runtime view，Env session 是浏览器或容器现场的运行时绑定，二者都不作为权威状态。

这个 PR 包含什么

持久化执行内核

新增 Postgres orchestration schema，覆盖 run、task、dependency、intent、attempt、result、checkpoint、verification、event、worker lease 等核心对象。
用 orchestration intent 推进 start run、replan、checkpoint resume 和 attempt finalize，避免把长任务控制逻辑放在 API 请求生命周期里。
建立 append-and-supersede 的历史模型：retry 追加 attempt，replan 追加新 task 并 supersede 旧分支，已终止记录不被原地重开。

控制面与任务运行时

新增 cmd/orchestrator，负责 planning、scheduling、recovery、verification recovery、event outbox、fact consumer 和 env reclaim。
新增 cmd/workerd 与 cmd/verifyd，分别执行 task attempt 和验证 completed attempt。
引入 worker / verifier claim fencing，过期 lease 或旧 epoch 的执行者不能继续提交状态。
覆盖 HITL checkpoint 的创建、resolve、timeout 和 resume 路径。

API 与产品面

新增 orchestration HTTP API：创建 run、读取 run/task/inspector、取消 run、停止或重试 task、处理 checkpoint。
API server 保持产品入口职责，不依赖 NATS 启动；需要控制面的写接口在 orchestrator 不可用时返回 503。
Web orchestration 页面提供 DAG、Flow timeline、task inspector、checkpoint、artifact、execution span 和 env panel。
生成并更新 OpenAPI spec 与 TypeScript SDK，让 Web UI 通过同一套 API surface 调用编排能力。

运行时契约

引入 NATS JetStream bus contract 和 Compose wiring，用于 runtime fact 与 committed event delivery。
新增 committed event outbox，把已提交事件发布给运行时和 UI 订阅路径。
新增 runtime fact envelope 与 fact validation。当前 fact consumer 是 read-only validation，用于发现 orphan、stale claim 和 identity mismatch。
新增 blackboard store/writer/CAS contract，作为可重建的运行期视图。
新增 EnvResource / EnvSession / EnvBinding / EnvSnapshot 初版 runtime，用于描述浏览器或容器现场的申请、保持、恢复和回收。
新增 irreversible action approval token fencing，为不可逆副作用预留审批边界。

RFC scope

这不是完整 RFC。本 PR 落地的是 core runtime 和 UI 基础。Quota / admission、artifact reservation、experience learning 仍属于后续工作。

Test plan

本分支已运行：

Kernel / service / handler behavior:
- go test ./cmd/agent ./cmd/orchestrator ./internal/orchestration ./internal/orchestrationexec ./internal/handlers
- control-plane split commit 的 pre-commit Go lint 和 Go test
Deployment wiring:
- docker compose -f docker-compose.yml config --quiet
- docker compose -f devenv/docker-compose.yml config --quiet
Consistency:
- git diff --check

Adds a Writer wrapper around the blackboard Store that codifies the PLAN.md Stage 2.2 ownership table: - Workers may only write under their own task scope and may not touch the verifier namespace. - Verifiers may only write the verifier namespace at run or task scope. - Orchestrator may write any scope. result.* writes always require CompareAndSwap. The Writer rejects bare Put on result.* with ErrCASRequired, and CompareAndSwap fences stale writers whose ClaimEpoch is older than the value already in the store (ErrStaleWriter), even when the revision matches. WriterIdentity bundles the writer_type / writer_id / task_id / attempt_id / claim_epoch the authorisation layer needs and is validated up front by NewWriter. A small Reader wrapper mirrors Writer for symmetry in dependency injection. Backends are still free to expose Store directly. Tests cover orchestrator/worker/verifier authorisation, foreign-task rejection, cross-namespace rejection, CAS-required Put rejection, stale claim_epoch fencing, revision conflict propagation, and the Reader round-trip.

Implements Store on top of a NATS JetStream KV bucket. The bucket name defaults to MEMOH_ORCH_BLACKBOARD; the entire orchestration runtime view lives in one bucket with run/task scopes namespaced inside via the canonical bb.{scope}.{owner}.{namespace}[.path...] key form. Revisions are passed straight through to JetStream so CompareAndSwap maps to the bucket's optimistic-concurrency primitives without any client-side emulation. Insert (expected==0) maps to KV Create; non-zero expected maps to KV Update; both surface CAS conflicts as ErrRevisionConflict regardless of whether the underlying error is ErrKeyExists or APIError(StreamWrongLastSequence). List opens a filtered watcher over prefix and prefix.>, drains until JetStream signals initial-sync complete, and rebuilds canonical Keys from the raw KV key strings. Delete leaves a NATS tombstone so the entry is hidden from List and surfaced as ErrNotFound on Get. A small factory selects in-memory or JetStream based on an empty URL, mirroring the orchestrationbus factory shape so callers can stay on the in-memory backend in tests and stand-alone CLI tools. Round-trip, CAS, list, and delete tests run against an actual NATS server when TEST_NATS_URL is set (mirrors the TEST_POSTGRES_DSN gating already used by other Postgres-backed tests). Buckets are deleted on test cleanup so devenv state stays clean.

…spatch Wires the kernel into the Stage 2 blackboard runtime view. - Service.SetBlackboardStore registers an optional Store. When unset, every blackboard call site short-circuits and behavior is identical to the pre-Stage-2 kernel; tests and callers that have not opted in see no change. - On task completion the kernel publishes bb.task.{task_id}.result.summary as the orchestrator-class writer. CAS keeps repeated commits idempotent without threading the worker's claim epoch through every kernel call site. Postgres remains the authoritative copy and blackboard publish failures are logged only. - On task dispatch the kernel snapshots the live revisions of run-scope context plus the dependencies' result.* keys into orchestration_input_manifests.captured_blackboard_revisions, so verifier replay can later reconstruct the same view. cmd/agent provides the JetStream- or in-memory-backed Store via the existing [nats] config block and wires it into the kernel through an fx.Invoke. Single-process deployments and tests stay on the in-memory backend through the empty-URL factory fall back. Integration coverage: TestIntegrationBlackboardCaptureAndPublish drives a full producer/consumer DAG through the kernel and asserts both the publish path (root + producer result entries appear in the store) and the capture path (consumer's manifest carries the producer's blackboard revision).

Adds the recovery primitive and admin entry point that close out the Stage 2 contract. After a JetStream KV bucket loss, an operator can invoke this against a quiesced run to repopulate the runtime view from the authoritative Postgres copy without manual intervention or replay. Service.RebuildBlackboard validates run access via the standard caller identity, then walks two Postgres sources: - orchestration_runs.goal/input/control_policy/source_metadata writes bb.run.{run_id}.context.goal as the run-scope snapshot a worker would see at dispatch. - orchestration_task_results joined with the producing attempt writes bb.task.{task_id}.result.summary entries, matching the payload shape the live publish path produces during CompleteAttempt. Each write goes through the orchestrator-class Writer with CAS against the current revision, so concurrent live writes still reject the rebuild attempt instead of silently overwriting fresher state. Per-key errors are counted in the result and logged but do not abort the rebuild; callers can compare write_errors vs the totals to decide whether to retry. When no Store has been wired the call is a no-op that returns blackboard_configured=false so HTTP callers can detect the deployment shape from the response rather than guessing. The new HTTP route POST /orchestration/runs/{run_id}/blackboard/rebuild is declared on the existing orchestration handler interface, gets a fake binding in the handler tests, and ships through swag + the TypeScript SDK regenerated by openapi-ts. Integration coverage: - TestIntegrationRebuildBlackboardRecoversAfterStoreWipe wipes the in-memory store mid-run, calls RebuildBlackboard, and asserts both run-context and task-result entries reappear with the orchestrator writer identity. - TestIntegrationRebuildBlackboardWithoutStoreReturnsConfiguredFalse pins the no-store contract so deploys without NATS keep returning the same negative signal.

…ointer Walk PLAN.md through what actually shipped: namespace layout, writer ownership, frozen InputManifest capture at dispatch, post-commit result.summary publish, and the Postgres-backed RebuildBlackboard with its admin endpoint and SDK binding. Call out the two open follow-ups (JetStream wipe-and-recover blackbox harness, artifact/verifier rebuild) without conflating them with the contract that just landed, and roll the Next Step pointer forward to Stage 3 env sessions since Stage 2 CAS now unblocks moving control transitions onto the bus.

Lay down the five tables that anchor the env session runtime per RFC.md: env_resources describe leasable templates (container image, browser context, etc.) tenant-scoped with a capacity, env_sessions track concrete leased instances and carry lease_epoch fencing analogous to attempt claim_epoch, env_lease_reservations carry the admission ticket through reserve/commit/abort so the scheduler can queue when capacity is saturated, env_bindings map a session to the task/attempt currently using it (with a partial unique index that lets a session outlive a single attempt only when held for HITL resume), and env_snapshots record point-in-time captures keyed by session for verifier replay and drift detection. Indexes back the lookups the runtime layer will need next: per-resource status scans for capacity decisions, lease expiry scans for the reclaim loop, attempt/run lookups for binding inspection, and the prioritised pending queue on reservations. The active-binding partial unique index encodes the invariant that an active or held session has at most one binding at a time. The incremental migration in 0082_add_orchestration_env_runtime is mirrored in 0001_init.up.sql per the project rule. Both up and down were exercised end-to-end against a fresh postgres database (memoh-server migrate up / down / up). sqlc generated Go bindings for each new table and a focused query set covering create / get / list / update lifecycle plus the lease expiry sweep and the pending reservation queue. No runtime code uses these tables yet. S3-B introduces the internal/env package and the EnvManager interface that maps these rows into reserve/commit/abort/bind/snapshot/release semantics.

Add the package that wraps the Stage 3 env_* tables behind a thin runtime surface. The Manager owns the durable state machine — register a Resource, AcquireSession (which writes a reservation row, inserts a session row in 'reserved' state, calls a Backend.Allocate out of band, then commits both), CreateBinding / HoldBinding / ResumeBinding / ReleaseBinding for task→session attachment, and CaptureSnapshot / RenewSessionLease / ReleaseSession / ReclaimExpiredSessions for the rest of the lifecycle. Every state transition validates a (lease_token, lease_epoch) tuple against the session row, mirroring the orchestration kernel's claim_epoch model so a stale holder cannot mutate state behind the kernel's back. Backend is the small driver-facing contract (Kind / Allocate / Snapshot / Release). Backends register against a BackendRegistry by kind; the manager refuses to allocate against an unregistered kind (ErrBackendUnavailable) so a misconfigured deployment fails loud at the dispatch boundary rather than silently fall back to an in-memory shim. NoopBackend ships in this commit for tests and for single-process deployments that have no real env runtime wired in yet — Stage 3-C and 3-D add the container and browser drivers. Notable design choices documented in code: - Acquire is two-phase against Postgres (reservation+session locked in step 1, backend.Allocate runs out of band, commit lands in step 2) so a slow backend never holds row locks. A crash between steps leaves rows the reclaim sweep can finalise. - ResumeBinding bumps lease_epoch and rotates lease_token — the fence that separates a held HITL attempt from its resumed successor. - Capacity is enforced with a SELECT count + CHECK at the reservation step. Stage 4 admission queueing will swap that for fairness scheduling; today the call returns ErrCapacityExceeded and the caller decides whether to back off. - Bindings have a partial unique index on (session_id) WHERE status IN ('active','held'), so the schema enforces the invariant that an active or held session owns at most one binding at a time. Renamed the package directory to internal/orchestrationenv after discovering env/ is .gitignored at the repo root for python venv conventions; the import path matches the Stage 1 / Stage 2 lineage (orchestrationbus, orchestrationblackboard). Tests cover happy-path acquire/release, capacity exceeded, stale lease rejection on release, hold/resume epoch+token rotation, snapshot capture and listing, expired-session reclaim, and duplicate binding rejection. Each test creates its own postgres database and runs the full migration set so coverage exercises the real schema, including the deferred root_task_id FK on orchestration_runs.

Wire the orchestrationenv.Backend implementation for KindContainer resources. The backend depends on a small Runtime interface (PullImage / CreateContainer / StartContainer / StopContainer / DeleteContainer / CommitSnapshot) that mirrors the subset of internal/container.Service used at the env-session layer. Keeping the dependency narrow means this package stays unit-testable with a fake runtime — the cmd/agent wiring (Stage 3-E) is what binds it to containerd, docker, or apple drivers in production. Behaviour highlights: - Allocate respects an image_pull_policy (always / if_not_present / never) before CreateContainer, then starts the workload. A start failure triggers best-effort DeleteContainer so a half-created container does not leak the capacity slot. - Container IDs and storage keys are derived deterministically from the env_session_id ("envs-<id>"/"envs-rw-<id>") so retries reattach to the same container/storage and operators can correlate containers back to env sessions at a glance via the memoh.orchestration_env.* labels stamped onto every container. - Snapshot maps onto SnapshotService.CommitSnapshot. Backends that return container.ErrNotSupported (apple virtualization today) are surfaced as runtime-ref-only results with an "unsupported": true flag so the env manager can still record metadata for inspector views; real failures propagate to the caller. - Release stops the container, tolerates stop failures (the runtime may have already exited), and surfaces delete failures so the manager can decide whether to retry. Container.ErrNotFound at either step is treated as success since the runtime intent (gone) is satisfied. ResourceConfig fields the backend honours today: image / image_ref, storage_driver / snapshotter, cmd, env, workdir, user. Stage 3-E will extend this as the kernel adds mount and network requirements. Tests cover the happy path, missing-image rejection, start-failure cleanup, snapshot success and unsupported and error paths, release under stop failure / delete failure / not-found, and the never-pull policy.

Wire the orchestrationenv.Backend implementation for KindBrowser resources. Allocate POSTs to the existing apps/browser (Bun/Elysia/Playwright) gateway's /session endpoint and persists the returned ws_endpoint, gateway_session_id, and session_token into the env session's runtime handle so the worker can drive Playwright remotely through the gateway. Release closes the gateway session via DELETE /session/:id?token=... The package depends on a small Gateway interface so unit tests stay independent of the real Bun service. NewHTTPGateway ships the production client and matches the on-the-wire schema documented in apps/browser/src/modules/session.ts; tests round-trip against a real httptest.Server to confirm the request/response shapes match. Snapshot is intentionally minimal in this stage. The browser gateway exposes no native snapshot endpoint — capturing browser state means playing Playwright scripts against the live session, which is a worker-side action rather than a backend primitive. The backend returns a stable bookkeeping ref (gateway_session_id, ws_endpoint, snapshot kind) marked unsupported=true so the env manager still records snapshot rows for inspector views; Stage 3-I replaces this with real cookie / storage / screenshot capture once drift detection is designed. Resource config knobs honoured today: core (chromium / firefox), ttl_ms, context_config (passed through verbatim). bot_id sent to the gateway is derived from env_session_id with a configurable "envs-" prefix so env-managed sessions stay distinguishable from real bot sessions in gateway logs and per-bot quotas. Tests cover happy-path allocate, default-core fallback, gateway error propagation, snapshot bookkeeping, release with present and empty handles, release error propagation, plus HTTP gateway round-trip and HTTP error status handling.

…anifests Add two JSONB columns that Stage 3-E uses to drive env session reservation. orchestration_tasks.env_preconditions captures what the planner declared for the task ("does this task need a container or browser session, and which kind?"). orchestration_input_manifests. captured_env_preconditions captures what the kernel actually resolved at dispatch time, so a verifier replay can rebuild the same lease context the worker saw. Both columns default to '{"required": false}' so the column stays NOT NULL without breaking existing rows or creating a NULL special case in readers. defaultEnvPreconditionsJSON feeds that sentinel into every CreateOrchestrationTask and CreateOrchestrationInputManifest call site (root task, planner-emitted child tasks, dispatch manifest, plus all integration test seeds), so behaviour is unchanged in this commit — the columns exist, sqlc has binding code for them, and nothing reads them yet. Subsequent S3-E commits add the planner contract, the dispatch hook that calls envManager.AcquireSession / CreateBinding, the manifest hash inclusion, the cmd/agent FX wiring, and the admin HTTP CRUD surface for env resources.

Promote env_preconditions from a deferred Stage 3 idea to a mandatory field on the planner contract. Every child task the start-run planner or replanner emits must now declare whether it depends on a leasable runtime environment. Required=false marks pure-LLM steps and remains the common case; Required=true demands kind ∈ {container, browser} and a non-empty resource_name so the kernel can later resolve the operator-managed env_resource and reserve a session before dispatch. Domain side: - internal/orchestration/types.go gains an EnvPreconditions struct (with Required/Kind/ResourceName/Mode/EffectClass/Metadata) plus EnvPreconditionsKind* and EnvPreconditionsEffect* constants that mirror the orchestration_env_resources / action ledger CHECK vocabulary. PlannedTaskSpec and Task now carry it as a first-class field, and toTask projects the env_preconditions JSONB column back into the struct. - runtime.go threads the planner-supplied value through the materialise path (replacing the default sentinel from S3-E.1), through the replacement_plan payload, and through plannedChildTasksFromSpecs / plannedChildTasksFromReplacementPlan. decodeEnvPreconditionsObject and normalizeEnvPreconditions centralise the structural validation: required=false strips every optional field, required=true enforces kind/resource_name and rejects unknown keys. Planner contract side: - internal/orchestrationexec/start_run_planner.go now requires env_preconditions on every emitted child task, with requiredPlannerEnvPreconditions / decodePlannerEnvPreconditions performing the same validation as the kernel-side decoder. - The system prompts for both the start-run planner and the replanner explicitly describe the env_preconditions schema and remind the model to mark required=true only for tasks that touch an external runtime. Tests: - New llm_runtime_test.go cases cover the happy path, the env-bound container shape, and three rejection paths (missing field, invalid kind, required=true without resource_name). - TestDecodeStartRunPlannerPayloadValidatesChildTasks and TestDecodeReplanPlannerPayloadUsesStrictChildTaskSchema now pass env_preconditions through so they exercise the new decoder. - The blackbox harness LLM stub emits env_preconditions=false on every fake child task it returns, keeping the runtime smoke tests green under the tighter contract. Replan payloads that pre-date Stage 3-E may still omit the field; plannedChildTaskEnvPreconditions falls back to required=false in that case so historic replans stay replayable without a data migration.

…mpletion The kernel now drives the env session runtime end to end. When a planner emits env_preconditions.required=true the dispatcher resolves the resource_name on the EnvManager, acquires a session, creates a primary binding for the attempt, and persists the captured envelope into captured_env_preconditions on the dispatch manifest. CompleteAttempt releases both the binding and the underlying session once the attempt reaches a terminal state, and dispatch aborts run a best-effort compensation against the env manager so a failed dispatch does not strand a session. The reclaim sweep remains the authoritative backstop for any state that loses the race. EnvManager is exposed as a primitive-typed interface in internal/orchestration so the kernel does not depend on the orchestrationenv package, which lets cmd/agent (S3-E.4) plug in the real Manager without a circular import and lets unit tests substitute a fake. The dispatch path threads an internal envCapture struct through the manifest so a verifier replay can fence stale callers without re-resolving the planner-supplied resource. Tasks with required=false take a fast path that never touches the env manager, keeping pure-LLM dispatches byte-identical with the column default. Integration coverage exercises both branches against a real database with a fake EnvManager, asserting the call sequence, the lease fencing tuple, the manifest capture, and the post-completion release.

…sweeps The orchestration kernel now sees a real environment session manager when it runs as part of the unified server. cmd/agent constructs the env backend registry from the same container service and browser gateway the rest of the server uses, instantiates orchestrationenv.Manager from the shared Postgres pool, and adapts it through KernelAdapter so the kernel keeps consuming the primitive orchestration.EnvManager interface declared in S3-E.2. A periodic reclaim loop sweeps expired sessions on a 30-second cadence so that lease TTLs alone do not strand env_resources after a worker crashes between dispatch and release. The default lease TTL stays at the kernel's 30-minute floor and operators can override it via MEMOH_ORCHESTRATION_ENV_LEASE_TTL_SECONDS without rebuilding. KernelAdapter lives in internal/orchestrationenv so the kernel never has to import the env package. The container backend opts in via a runtime type assertion against the existing ctr.Service value: deployments whose container service does not satisfy the wider env runtime surface skip container env resources rather than refusing to boot, and deployments without a browser gateway skip the browser backend the same way. Missing backends surface as ErrBackendUnavailable on dispatch, which keeps the misconfiguration visible at the call site instead of papering over it.

Expose tenant-scoped env resource CRUD so operators can manage runtime templates before planner dispatch depends on them.

Classify orchestration actions and attach env session plus snapshot references so env-backed attempts are auditable from dispatch through release.

Hold env bindings across resume_held_env checkpoints and reattach them to the next attempt with a rotated lease so HITL resumes keep the same runtime safely.

Flatten planner and completion tool calls so LLM outputs fail in controlled validation paths, and keep browser automation modeled as an env with context/exclusive modes.

Add the orchestration sidebar, env resource pages, image management views, and updated run details so the frontend matches the expanded orchestration APIs.

Reuse the chat tool-call presentation in the run inspector so orchestration actions stream into a familiar, compact Act tab.

Use orchestration intent terminology across storage, runtime, API contracts, and generated clients so control intents are no longer modeled as planner-only lifecycle state.

Replace the monolithic orchestration run view with Vue Flow based DAG and timeline views, including lane-local time compression and clearer status metadata.

Use theme-aware neutrals for orchestration chrome while keeping selected tasks readable with a subtle primary accent.

Move orchestration control-plane loops into a dedicated orchestrator process so the API server no longer owns NATS-backed runtime coordination.

Consolidate the branch's orchestration schema changes into a smaller migration sequence so review and fresh database setup stay manageable.

Add task-scoped cancellation and retry APIs, including durable requeue support for failed start-run planning so planner failures can recover through the same task control path.

Place retry actions directly on failed DAG and flow task cards, and keep the environment inspector focused on the environment name and type.

sheepbox8646 added the size-xl label Apr 29, 2026

Fodesu force-pushed the orchestration branch from a94af88 to f278a82 Compare May 4, 2026 12:33

Fodesu marked this pull request as ready for review May 4, 2026 15:10

Fodesu requested review from chen-ran and sheepbox8646 May 4, 2026 15:10

chen-ran force-pushed the orchestration branch from d92b3bf to 91e2c53 Compare May 7, 2026 06:16

Fodesu force-pushed the orchestration branch from c51a118 to 22d98e4 Compare May 11, 2026 10:51

Fodesu added 23 commits May 18, 2026 01:25

feat(orchestration): implement orchestration control plane

a0464b0

feat(orchestration): implement runtime replan and workerd loop

0a08367

feat(orchestration): add verifier runtime and worker loop

b8dd54c

refactor(orchestration): refine attempt binding runtime

4135224

fix(orchestration): harden run barrier sibling pause handling

89493db

feat(orchestration): add run barrier resume and API coverage

8c10c24

feat(orchestration): add HTTP runtime blackbox coverage

53ab196

feat(orchestration): add llm runtime execution and action ledger

114d5c9

feat(agent): expose orchestration tool and control APIs

51a5212

feat(deploy): bundle orchestration workers in docker

f3822b7

feat(web): add orchestration run inspector

7285432

feat(orchestration): add run watch, hint, and retry controls

e091b77

fix(orchestration): tighten retry and hint runtime guards

f4636ab

feat(orchestration): add LLM start-run planner

e19db9f

docs(orchestration): add runtime plane plan

548caa8

feat(orchestration): tighten executor runtime boundary

2b38098

feat(orchestration): tighten planner output contract

ac9d0b7

feat(orchestration): add LLM replanning runtime

f8e3317

feat(orchestration): enforce planner runtime limits

aee4c78

fix(orchestration): harden executor replay recovery

b279ce6

feat(orchestration): add verifier retry policy

9f13303

feat(orchestration): delay automatic retry dispatch

5ab7b19

fix(agent): honor canceled mock generation context

5d6d308

Fodesu added 26 commits May 18, 2026 01:50

feat(orchestration): add env resource admin API

00aea7e

Expose tenant-scoped env resource CRUD so operators can manage runtime templates before planner dispatch depends on them.

feat(orchestration): record env actions in ledger

10d64dc

Classify orchestration actions and attach env session plus snapshot references so env-backed attempts are auditable from dispatch through release.

feat(orchestration): resume held env sessions after HITL

fef5170

Hold env bindings across resume_held_env checkpoints and reattach them to the next attempt with a rotated lease so HITL resumes keep the same runtime safely.

feat(orchestration): harden env execution and tool flows

ae98597

Flatten planner and completion tool calls so LLM outputs fail in controlled validation paths, and keep browser automation modeled as an env with context/exclusive modes.

feat(orchestration): refresh run and env management UI

cc63808

Add the orchestration sidebar, env resource pages, image management views, and updated run details so the frontend matches the expanded orchestration APIs.

feat(orchestration): show node actions like chat tools

b1912dd

Reuse the chat tool-call presentation in the run inspector so orchestration actions stream into a familiar, compact Act tab.

refactor(orchestration): rename planning intents

7a47909

Use orchestration intent terminology across storage, runtime, API contracts, and generated clients so control intents are no longer modeled as planner-only lifecycle state.

feat(orchestration): rebuild run graph views

ee41d30

Replace the monolithic orchestration run view with Vue Flow based DAG and timeline views, including lane-local time compression and clearer status metadata.

style(orchestration): soften run view accents

e77f26d

Use theme-aware neutrals for orchestration chrome while keeping selected tasks readable with a subtle primary accent.

refactor(orchestration): split control plane daemon

15fbade

Move orchestration control-plane loops into a dedicated orchestrator process so the API server no longer owns NATS-backed runtime coordination.

refactor(orchestration): squash migrations

9417b8c

Consolidate the branch's orchestration schema changes into a smaller migration sequence so review and fresh database setup stay manageable.

feat(orchestration): support task retry and cancel

46e3305

Add task-scoped cancellation and retry APIs, including durable requeue support for failed start-run planning so planner failures can recover through the same task control path.

feat(orchestration): move task controls into run views

7c6b3c5

Place retry actions directly on failed DAG and flow task cards, and keep the environment inspector focused on the environment name and type.

chen-ran force-pushed the orchestration branch from 22d98e4 to 7c6b3c5 Compare May 17, 2026 17:56

fix(orchestration): remove browser gateway

12f2903

danielwpz mentioned this pull request May 18, 2026

🛰️ 飞书生态开源项目雷达 · 2026-05-18 lark-radar/lark-radar#8

Open

sheepbox8646 force-pushed the main branch from 5653320 to a9fe3ca Compare May 22, 2026 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestration): introduce orchestration system#408

feat(orchestration): introduce orchestration system#408
Fodesu wants to merge 61 commits into
memohai:mainfrom
Fodesu:orchestration

Fodesu commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Fodesu commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why：长复杂任务需要执行层

可观察性

可恢复性

局部重试

验证与审计

Who：组件职责

What：这个 PR 引入的运行模型

When：适用场景

Where：代码与部署边界

How：核心设计原则

状态与历史

控制面与执行面

恢复与并发控制

运行时契约与观测

How it works：运行模型

组件边界

创建任务链路

Run 生命周期

Runtime facts 与观测

这个 PR 包含什么

持久化执行内核

控制面与任务运行时

API 与产品面

运行时契约

RFC scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fodesu commented Apr 26, 2026 •

edited

Loading